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ABSTRACT 

This paper considers the problem of efficiently answering 
reachability queries over views of provenance graphs, de- 
rived from executions of workflows that may include re- 
cursion. Such views include composite modules and model 
fine-grained dependencies between module inputs and out- 
puts. A novel view-adaptive dynamic labeling scheme is de- 
veloped for efficient query evaluation, in which view specifi- 
cations are labeled statically (i.e. as they are created) and 
data items are labeled dynamically as they are produced 
during a workflow execution. Although the combination of 
fine-grained dependencies and recursive workflows entail, in 
general, long (linear-size) data labels, we show that for a 
large natural class of workflows and views, labels are com- 
pact (logarithmic-size) and reachability queries can be eval- 
uated in constant time. Experimental results demonstrate 
the benefit of this approach over the state-of-the-art tech- 
nique when applied for labeling multiple views. 

1. INTRODUCTION 

The ability to manage workflow provenance is increasingly 
important for scientific as well as business applications. For 
example, if an input to a workflow execution is discovered to 
be incorrect, we may wish to determine whether a particular 
workflow output depends on it and is thus also potentially 
incorrect. Finding efficient techniques to answer such reach- 
ability queries is thus of particular interest. 

However, provenance information can be extremely large, 
so we may wish to provide different views of this information. 
For example, users may wish to specify abstraction views 
which focus user attention on relevant provenance informa- 
tion and abstract away irrelevant details, an idea proposed 
in [8]. Workflow owners may also wish to specify security 
views which can be used to hide private information from 
certain user groups (e.g., sensitive intermediate data and 
module functionality [10]). Provenance views consist of a 
set of composite modules which encapsulate subworkflows. 

Example 1. Figure 1 shows an abstraction of a real-life 
scientific workflow collected from the myExperiment reposi- 
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Figure 1: Views with Fine-Grained Dependencies 

tory [19]. It generates atom signatures for individual com- 
pounds given a Structural Data File (SDF) as input (ignore 
for now the dashed edges inside modules Mi and Mi). In a 
high-level view of this workflow, users see only one compos- 
ite module, indicated as the big dashed box, with two inputs 
(di and di) and two outputs (di and d$), while modules Mi 
and Mi and intermediate data dz are hidden. 

An important thing to keep in mind is that Workflow 
provenance not only records the order of module executions 
but also the dependencies between inputs and outputs of 
modules. Therefore, workflow views should explicitly specify 
the input-output dependencies for modules that are exposed 
to users. Previous research [13, 21, 4, 5] has adopted a sim- 
plified provenance model which assumes that every output 
of a module depends on every input, termed black-box de- 
pendencies. However, a more fine-grained provenance model 
captures the fact that the output of a module may depend 
on only a subset of its inputs. 

To understand why fine-grained dependencies are useful, 
consider the two types of views mentioned earlier. In ab- 
straction views, although irrelevant workflow details are hid- 
den inside composite modules, users should still be able 
to see the true dependencies between inputs and outputs 
of composite modules (white-box dependences). In security 
views, however, one may want to hide the true dependencies 
between inputs and outputs of certain composite modules in 
order to preserve structural or module privacy [10] . To this 
end, one may move to somewhere on the spectrum between 
white-box and black-box dependencies (grey-box dependen- 
cies). With grey-box dependencies, additional (false) de- 
pendencies between inputs and outputs may be added. 

Example 2. Returning to Figure 1, fine-grained depen- 
dencies between the inputs and outputs of modules Mi and 
M2 are indicated as dashed edges inside the modules. In an 
abstraction view, the composite module would be associated 
with white-box dependencies, in which di depends on di but 
not on d2- However, in a security view, the composite mod- 
ule could be associated with a grey-box dependency matrix 
in which every output depends on every input. Hence, the 
answer to the reachability query "Does di depend on di ?" is 
different in the two views. 
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This paper considers the problem of efficiently answering 
reachability queries over views of provenance graphs, of the 
types illustrated above. A common approach for processing 
reachability queries is to label data items so that the reacha- 
bility between any two items can be answered efficiently by 
comparing their labels. Moreover, data items must be la- 
beled dynamically as soon as they are produced during the 
execution, since scientific workflows can take a long time to 
execute and users may wish to query partial executions. 

In contrast to previous work, we study effective dynamic 
labeling in the context of (f) fine-grained dependencies be- 
tween inputs and outputs of modules; and (2) views with 
grey-box dependencies. This context introduces several new 
challenges. First, none of the existing dynamic labeling 
schemes applies to fine-grained dependencies, since they all 
rely on a simplified provenance model with black-box depen- 
dencies. Second, due to grey-box dependencies, the answer 
to a reachability query may alter in different views. A brute- 
force approach to handling multiple views is to label data 
items for each view repeatedly and separately. This has two 
drawbacks: (i) large index: for each data item, we must 
maintain one label for each view; and (ii) expensive index 
maintenance: when a new view is added, all existing data 
items must be re-labeled. To address the challenges, more 
effective labeling techniques must be developed. The main 
contributions of this paper are summarized as follows. 

• We propose a formal model based on graph grammars 
which capture a rich class of (possibly recursive) workflows 
with fine-grained dependencies between the inputs and out- 
puts of modules. We then use the model to formalize the 
notion of views. They are defined over the workflow specifi- 
cation and then naturally projected onto its runs (Section 2). 

• To get a handle on the difficulty introduced by fine-grained 
dependencies to the dynamic labeling problem, we prove 
that in general, long (linear-size) labels are required. We 
further show that common restrictions on the workflow spec- 
ification, that sufficed to reduce the label length for black- 
box dependencies [5], are no longer helpful. Nevertheless, 
we identify a large natural class of safe views over strictly 
linear-recursive workflows for which dynamic, yet compact 
(logarithmic-size) labeling is possible (Section 3). 

• Based on this foundation we propose a novel labeling ap- 
proach whereby view specifications are labeled statically (i.e. 
as they are created), whereas data items are labeled dynam- 
ically as they are produced during a workflow execution. At 
query time, the labeling of the view over which the reach- 
ability query is asked is used to augment the data labels 
to provide the correct answer in constant time. We call 
this a view-adaptive dynamic labeling scheme. It has the 
great advantage that, since data labels are unrelated to any 
view, views can be added/deleted/modified without having 
to touch the data. It is both space-efficient and time-efficient 
relative to the brute- force approach (Section 4). 

• Finally, we evaluate the proposed view-adaptive labeling 
scheme over both real-life and synthetic workflows. The ex- 
perimental study demonstrates the superiority of our view- 
adaptive labeling approach over the state-of-the-art tech- 
nique [5] when applied to label multiple views (Section 5). 

Related Work. Before presenting our results, we briefly re- 
view related work. The problem of reachability labeling has 
been studied for different classes of graphs in both static and 
dynamic settings. Ideally, one would like to build compact 
(logarithmic-size) labels which enable efficient (constant) 
query processing. While compact and efficient labeling is 



shown to be feasible for static trees [20] , when labeling gen- 
eral directed acyclc graphs (DAGs), any possible scheme 
requires linear-size labels even if arbitrary query time is 
allowed [4]. On the other hand, dynamic labeling is also 
much harder than static labeling. [9] shows that even label- 
ing dynamic trees requires linear-size labels. Fortunately, 
although workflow runs can have arbitrarily more complex 
DAG structures than trees, [4, 5] show that knowledge of 
the specification can be exploited to obtain compact and 
efficient labeling schemes for both static and dynamic runs 
derived from a given specification. A more detailed compar- 
ison between existing static and dynamic labeling schemes 
for XML trees [20, 1, 9, 18, 23], for DAGs [15, 24, 22, 16, 
11] and for workflow runs [13, 4, 5] is summarized in [5]. 
However, as mentioned above, none of the existing dynamic 
labeling schemes is applicable to our problem as they neither 
support fine-grained dependencies nor handle views. 

2. MODEL AND PROBLEM STATEMENT 

We present a fine-grained workflow model with white-box 
dependencies in Section 2.1. Based on this model, we define 
views with grey-box dependencies in Section 2.2. Section 2.3 
formulates the view-adaptive dynamic labeling problem. 

2.1 Fine-Grained Workflow Model 

Our workflow model is built upon two concepts: workflow 
specification, which describes the design of a workflow, and 
workflow run, which describes a particular workflow execu- 
tion. We model the structure of a specification as a context- 
free workflow grammar whose language corresponds to ex- 
actly the set of all possible runs of this specification. The 
grammar that we use is similar to [5, 7]. However, previous 
work [17, 13, 21, 5, 7] adopted a simplified provenance model 
which implicitly assumes black-box dependencies - every out- 
put of a module depends on every input. In contrast, this 
paper proposes a more fine-grained provenance model which 
captures the fact that an output of a module may depend 
on only a subset of inputs. We call this white-box dependen- 
cies. In particular, our model associates the grammar with 
a dependency assignment that explicitly specifies the depen- 
dencies between inputs and outputs of atomic modules. 

The basic building blocks of our model are modules and 
simple workflows. A module has a set of input ports and a 
set of output ports; and a simple workflow is built up from a 
set of modules by connecting their input and output ports. 

Definition 1. (Module) A module is M = (1,0), where 
I is a set of input ports and O is a set of output ports. 

Definition 2. (Simple Workflow) A simple workflow is 
W = (V, E), where V is a multiset of modules and E is 
a set of data edges from an output port of one module to 
an input port of another module. Each data edge carries a 
unique data item that is produced by the former and then 
consumed by the latter. Input ports with no incoming data 
edges are called initial input ports; and output ports with 
no outgoing data edges are called final output ports. 

To simplify the presentation, we assume that (1) pairwise 
non-adjacent data edges: any pair of data edges are not in- 
cident to the same port; and (2) acyclic simple workflow. 
data edges do not form cycles among the modules. Note 
that the above two restrictions do not limit the expressive 
power of our model. For (1), adjacent data edges can be 
resolved by introducing dummy modules that distribute or 
aggregate multiple data items. For (2), we will see that loops 
can be implicitly captured by recursive productions. 
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Example 3. The top left corner of Figure 2 shows a mod- 
ule S with two input ports and three output ports, which 
are denoted by solid and empty cycles, respectively. The top 
right corner of Figure 2 shows a simple workflow W\ with six 
modules and ten data edges (solid edges, ignore the dashed 
edges inside modules for now). W\ has two initial input 
ports and three final output ports, which are highlighted by 
solid and empty thick arrows, respectively. 

To build a new workflow, an existing (simple) workflow 
may be reused as a composite module. This is modeled by a 
workflow production. 

Definition 3. (Workflow Production) A workflow pro- 
duction is of form M —If W, where M is a composite mod- 
ule, W is a simple workflow and / is a bijection that maps 
input ports and output ports of M to initial input ports and 
final output ports of W, respectively. When / is clear from 
the context, we simply denote a production by M — > W. 

Example 4. In Figure 2, each row defines one or two 
productions. For example, the first row defines S — > Wi, and 
the second row defines A —¥ Wi and A — > W3 . Note that A 
also appears as a composite module in both Wi and Wa ■ For 
simplicity, we assume that for each production M — > W, the 
(initial) input ports and (final) output ports of M and W 
are mapped by f from top to bottom as shown in the figure. 

The context-free workflow grammar is a natural extension 
of the well-known context-free string grammar, where mod- 
ules correspond to characters, and simple workflows that 
are built up from modules correspond to strings that are se- 
quences of characters. In particular, atomic and composite 
modules correspond to terminals and variables, respectively. 
We also define a start module and a finite set of workflow 
productions. By Definition 3, each production M — >f W 
replaces a composite module M with a simple workflow W. 
The data edges adjacent to M are connected to W based 
on the bijection /. The language of a context-free workflow 
grammar consists of all simple workflows that can be derived 
from the start module and contain only atomic modules. 

Following the standard notations for string grammars, 
given a finite set E of modules, let S* denote the set of 
all simple workflows that are built up from a multiset of 
modules in E. Given two simple workflows W\ and Wi, let 
W\ Wi denote that Wi can be derived from W\ by ap- 
plying a sequence of zero or more productions, and / is a 
bijection that maps initial input ports and final output ports 
from W\ to Wi. Again, / may be omitted for simplicity. 

Definition 4- (Context-Free Workflow Grammar) A 

context-free workflow grammar (abbr. workflow grammar) is 
G — (E, A, S, P), where E is a finite set of modules, ACE 
is a set of composite modules (then E \ A is the set of atomic 
modules), S £ E is a start module, and P = {M — > W \ M £ 
A,W £ E*} is a finite set of workflow productions. The 
language of G is L(G) = {R £ (E \ A)* | S ^* R}. 

Example 5. Our running example of a workflow gram- 
mar G is shown in Figure 2. Composite modules are indi- 
cated by uppercase letters and atomic modules by lowercase 
letters. Formally, G = (T,,A,S,P), where E = {S, A, B, 
E, a, b, .. ., /}, A = {S, A, B, .. ., E}, and P = 
{pi = S -¥ Wi, pi = A -> Wi, p 3 = A -> Wz, pa = 
B -»• Wa, p 5 = C -»• W B , pe = D -> W 6 , Pi = D -> W 7 , 
ps — E — > Ws}. Note that pi and P4 form a recursion be- 
tween A and B. pe forms a self-recursion over D, and along 
withpr, indicates a loop (sequential execution) over f . 
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Figure 2: Workflow Specification 



One possible simple workflow run R £ L(G) is shown in 
Figure 3, where the atomic modules in R are denoted by solid 
boxes, and the composite modules that are created during the 
derivation of R are denoted by dashed boxes. We create a 
unique id for each atomic and composite module in R by 
appending a distinct number to the module name, di, di, 
. . ., d±\ are unique ids for data items (data edges) in R. 
For sake of illustration, we omit details of C :\, C : 2 and 
C:3, and show details of C:A in Figure 4- Observe that R 
can be derived from S by applying a sequence of productions 
pi, pi, pa, pi, pa, p 3 , Ps, Pe,, Pe, Pi, Ps, ■■■ 

So far we consider only workflow structure - the way in 
which modules are connected to construct workflows. Next, 
we enrich the model by defining fine-grained dependencies 
between inputs and outputs of atomic modules. Naturally, 
we assume that every input contributes to at least one out- 
put; and every output depends on at least one input. 

Definition 5. (Dependency Assignment) Given a fi- 
nite set E of modules, a dependency assignment to E is a 
function A that, for each module M = (/, O) £ E, defines a 
set X(M) of dependency edges from / to O, such that Vi £ I, 
3o £ O, (i, o) £ A(M); and Vo £ O, 3i £ I, (i, o) £ A(M). 

Finally, combining all the above components, our fine- 
grained workflow model is formalized as follows. 

Definition 6. (Fine-Grained Workflow Model) A 

workflow specification is G x , where G = (E, A, S, P) is a 
workflow grammar and A is a dependency assignment to 
E \ A. The set of all workflow runs w.r.t. G x is L(G X ) = 
{R x I R £ L(G)}, where R x is obtained from R by adding 
to each module M in R a set A(M) of dependency edges. 

Example 6. For the grammar G in Figure 2, we define 
a dependency assignment A to all atomic modules (i.e., a, 
b, . . ., f). The dependency edges introduced by A are shown 
in Figure 2 as dashed edges from input ports to output ports 
of atomic modules. With both data (solid) and dependency 
(dashed) edges, Figures 3 and 4 represent a run R x £ L(G X ). 
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Figure 4: Details of Composite Module C:4 

In Section 3, we will compare our fine-grained model (i.e., 
with white-box dependencies) to the existing coarse-grained 
model (i.e., with black-box dependencies) [5, 7]. Both are 
grammar-based, but the coarse-grained model is less expres- 
sive, and captures only a subclass of fine-grained workflows. 

Definition 7. (Coarse-Grained Workflows) A work- 
flow specification G x is said to be coarse-grained if (1) A 
is defined such that for any atomic module, every output 
depends on every input; and (2) every simple workflow used 
by G has a single source module and a single sink module 1 . 

2.2 Views with Grey-Box Dependencies 

A workflow view is constructed over a specification and 
then projected onto its runs. Such approach is common in 
workflows [8, 21, 10] (unlike typical database views that are 
defined via queries), but our work is the first to be based on a 
fine-grained model. Formally, a view is defined by two com- 
ponents. One describes the structure of a view by restricting 
the possible expansions of workflow hierarchy to a subset 
of composite modules. The other specifies the "perceived" 
fine-grained dependencies between inputs and outputs of all 
unexpandable modules in this view. As mentioned in Sec- 
tion 1, for abstraction views, the perceived dependencies al- 
ways reflect the true dependencies, which we call white-box 
dependencies. In contrast, for security views, false depen- 
dencies may be introduced in order to hide private prove- 
nance information, which we call grey-box dependencies. 

Definition 8. (Workflow View) Give a workflow speci- 
fication G x = (S, A, S, P) x , a view over G x is defined by a 
pair (A', A') , where A' C A is a subset of composite mod- 
ules and A' is a new dependency assignment for E \ A'. In 
particular, (A, A) is said to be the default view over G x . 

Remark 1. As will be seen in Section 3.1, from the input- 
output dependencies of atomic modules, we can compute 
those of composite modules. We thus say that a view (A', A') 
has white-box dependencies, if A' defines the same depen- 
dencies as A does, otherwise, it has grey-box dependencies. 

1 (2) ensures black-box dependencies for composite modules. 
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Figure 5: View of Workflow Specification 

A view U — (A', A') defined over a specification G x pro- 
duces a new grammar, denoted Ga', by restricting G to 
the subset of productions for composite modules in A'. To- 
gether with A', it defines a new specification, denoted Gu = 
(Ga') A i which we call a view of this specification. Similarly, 
given a run R x G L(G X ), by restricting the derivation of R 
to only productions for composite modules in A' and using 
A', we obtain a view of this run, denoted Ru — (Ra') X ■ 

Example 7. Using the specification G x in Figure 2, we 
define a view U = (A', A'), where A' — {S,A,B}. The new 
grammar Ga' is shown in Figure 5, which contains only the 
productions for S, A and B. Note that C is treated as an 
atomic module in this view, which makes D, E and f un- 
derivable. Therefore, A' needs to be defined for only atomic 
modules a, b, c, d, e and C . The dependency edges intro- 
duced by A are shown in Figure 5 as dashed edges. Com- 
paring with A defined in Figure 2, we observe that A (G) is 
newly defined, A'(e) is changed, and others are unchanged. 
Hence, this view introduces grey-box dependencies. 

We project this view onto the run R x in Figures 3 and 4- 
Since C is treated as atomic, details of C : 1, C : 2, G : 3 
and C : 4 (Figure 4) o,re hidden and Ra' has exactly the 
structure in Figure 3. However, all the dependency edges 
for Ra> should be given according to A' as in Figure 5. 

In the rest of this paper, we may simply denote a specifi- 
cation by G and a run by R, since the original dependency 
assignment A is irrelevant to views (i.e., overwritten by A'). 

2.3 View-Adaptive Dynamic Labeling 

We start with the basic dynamic labeling problem. The 
goal is to assign each data item a reachability label as soon 
as it is produced (dynamically) such that using only the la- 
bels of any two data items, we can quickly decide if one de- 
pends on the other. Two different but related dynamic label- 
ing problems were formulated in [5]. In the execution-based 
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problem, atomic modules of a run are generated one-by-one 
according to some topological ordering. In the derivation- 
based problem, a run is derived from the start module by 
applying a sequence of productions. As observed in [5] , any 
solution for the former also provides a solution for the latter. 
We thus focus only on the derivation-based problem. 

Definition 9. [5] (Dynamic Labeling) A dynamic label- 
ing scheme for a given specification G x is (<f>, n), where <f> is 
a labeling function and 7r is a binary predicate. <f> takes as 
input a derivation of a run R x G L(G X ), that is, a sequence 
of productions that transform the start module S to R. Ini- 
tially, 4> assigns a label 4>(d) to each input and output d of 
S. In the ith step of the derivation, </> assigns a label <j>(d) 
to each new data item d introduced by the ith production. 
Note that we do not know the production sequence in ad- 
vance, but receive them online. The assigned labels cannot 
be modified subsequently. 4> and ir are such that for any 
derivation of a run R x G L(G X ) and any two data items di 
and d2 in R x , 7r(0(di), <f>{di)) = true iff di depends on d\. 

In contrast to the previous work [5], this paper studies 
the dynamic labeling problem in more general and useful 
workflow settings. Specifically, we consider (1) fine-grained 
input-output dependences and (2) views with grey-box depen- 
dencies. Both ingredients entail new challenges, which will 
be addressed in Sections 3 and 4, respectively. 

To handle views, we propose in Section 4 a novel view- 
adaptive labeling approach whereby view specifications are 
labeled statically (i.e., as they are created), whereas data 
items are labeled dynamically as they are produced during 
a workflow execution. At query time, the label of the view 
over which the query is asked is combined with the labels 
of relevant data items to provide the correct answer. In 
this framework, since data labels are unrelated to any view 
(view-adaptive), views can be added/deleted/modified with- 
out having to touch the data. It is both space-efficient and 
time-efficient relative to the alternative approach where data 
items are labeled repeatedly and separately for each view. 

Definition 10. (View- Adaptive Dynamic Labeling) 

A view-adaptive dynamic labeling scheme for a given speci- 
fication G is (4> r ,(j>v,^), where 4> r is a labeling function for 
runs, 4>v is a labeling function for view specifications, and ir 
is a ternary predicate. Given a derivation of a run R G L(G), 
4> r as before assigns a label 4>r(d) (called data label) to each 
data item d as soon as it is produced during the derivation 
of R. Given a view U over G, 4> v treats U as one object and 
assigns a label 4> V (U) (called view label). cf> r , <f> v and ir are 
such that for any derivation of a run R G L(G), any view U 
over G and any two data items di and di in Ru, n((f) r (di), 
4>r{d2),4>v{U)) = true iff di depends on d\ w.r.t. U. 

A (view-adaptive) dynamic labeling scheme is said to be 
compact if for any derivation of a run with n data items, 
it creates data labels of O(logn) bits. Clearly, it provides 
shortest possible data labels up to a constant factor. 

3. FEASIBILITY OF DYNAMIC LABELING 

To address the challenges brought by fine-grained depen- 
dencies, we first consider the basic dynamic labeling problem 
(see Definition 9), where there is only one default view de- 
fined over the specification. Note that the labels created for 
the default view also work for other views with white-box 
dependencies, but not those with grey-box dependencies. 

As a formal analysis, we present in this section a clas- 
sification of fine-grained workflows based on the feasibility 



of developing (compact) dynamic labeling schemes. In Sec- 
tion 3.1, we first identify a class of safe workflows, and show 
that they are the largest set of workflows that allow dy- 
namic labeling schemes. In Section 3.2, we further iden- 
tify a class of strictly linear-recursive workflow structures 
for which dynamic, yet compact labeling schemes are possi- 
ble. Polynomial-time algorithms are also given to decide if a 
workflow is safe or if its structure is strictly linear-recursive. 

Interestingly, our results show that the common restric- 
tion on the workflow structure, which sufficed to reduce the 
label length for black-box dependencies [5], are no longer 
helpful. This formally proves the difficulty introduced by 
fine-grained dependencies to the dynamic labeling problem. 

3.1 Safe Workflows 

Some workflows cannot be labeled on-the-fly even if arbi- 
trary label size is allowed. We illustrate by an example. 



Figure 6: Unsafe Workflow 



Example 8. Consider the specification in Figure 6 with 
two productions S — > a and S — > b. di and di are an input 
and an output of S, respectively. Observe that if S — > a is 
applied, then di depends on di; otherwise (if S —¥ b is ap- 
plied), di does not depend on d\ . Recall from Definition 9 
that the labels for d\ and di must be assigned before we see 
the production, and cannot be modified subsequently. There- 
fore, no dynamic labeling schemes exist for this example. 

In general, if two simple workflows with only atomic mod- 
ules can be derived from the same composite module, and 
they are inconsistent, in the sense that they have different 
dependencies between initial inputs and final outputs, then 
dynamic labeling is impossible for this specification. Such 
workflows are said to be unsafe, and the others are safe. 

Definition 11. (Safe Workflow) A workflow specifica- 
tion G x = (S, A,S,P) X is said to be safe if VM G A and 
Wi, W 2 € (E \ A)* such that M ^* W x and M ^* W 2 , W x 
is consistent with Wi w.r.t. A. Also, A is said to be safe if 
G x is safe; and a view U is said to be safe if Gu is safe. 

Remark 2. Safety is a natural restriction on fine-grained 
workflows. It essentially says that for any module, either- 
atomic or composite, the dependences between inputs and 
outputs are deterministic, in the sense that they can be pre- 
dicted from the specification, and are consistent among all 
possible executions. In particular, by Definition 7, any coarse- 
grained workflow (i.e., with black-box dependencies) is al- 
ways safe. Moreover, it is important to notice that from the 
perspective of data provenance, the output of an aggregate 
function depends on each of its inputs [3], even though the 
output may take the value from only one of its inputs (e.g., 
"max" or "min" functions) . Therefore, a workflow that use 
those aggregate functions as modules is still safe. 

Our first result shows that safety characterizes the feasi- 
bility of dynamic labeling for fine-grained workflows. 

Theorem 1. Given any workflow specification G x , there 
is a dynamic labeling scheme for G x iff G x is safe. 

Proof. (Sketch) By Definition 11, unsafe workflows do 
not allow any dynamic labeling schemes. On the other hand, 
the view-adaptive dynamic labeling scheme, which we will 
present in Section 4, can be modified to label arbitrary safe 
workflows, though it may create linear-size data labels. □ 
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It is possible to test in polynomial time if a given specifica- 
tion G x is safe. Our algorithm based on Lemma 1 is briefly 
described as follows. We start by defining A* = A for each 
atomic module, and then compute A* for composite modules 
by verifying all the productions. A production M — > W is 
said to be verifiable, if A* is already defined for all the mod- 
ules in W, so that X*(M) can be computed. The algorithm 
reports that G x is safe, if A* is consistently defined for all 
composite modules, and outputs A* as a by-product. 

Lemma 1. (Full Assignment) A workflow specification 
G x = (E, A, S, P) x is safe iff there is a unique dependency 
assignment A* to E ( called the full dependency assignment ) 
such that (1) VM € E \ A, A*(M) = A(M); and (2) VM -> 
W G P. M is consistent with W w.r.t. A*. 
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Figure 7: Full Dependency Assignment 

Example 9. We illustrate the above algorithm using the 
specification G x in Figure 2. Initially, both p 7 = D —> W7 
and ps = E — > Ws are verifiable. We compute A*(D) and 
X*(E) by P7 and p%. Once \*(D) and A* (15) are defined, 
ps = C — > W5 and pq — D — > Wq become verifiable. We 
compute A*(C) by pe, and verify that A*(D) computed by 
p@ is consistent with the one computed before by p-j . We 
continue this process until all the productions are verified. 
Hence, G x is safe, and A* is shown on the top of Figure 7. 
Similarly, one can verify that the view U = (A', A') defined 
in Example 7 is safe using Figure 5. The full dependency 
assignment for U is shown on the bottom of Figure 7. Com- 
paring the two full assignments in Figure 7, while B gets the 
same dependencies, the ones for S and A are different. 

3.2 Linear- Recursive Workflow Structures 

For safe workflows, we further examine the feasibility of 
developing compact dynamic labeling schemes. First of all, 
a negative result in [5] shows that there is a coarse-grained 
workflow that does not allow any compact dynamic labeling 
scheme. By Definition 7 and Lemma I, we know that any 
coarse-grained workflow is safe. So the negative result also 
applies to the fine-grained model: there is a safe workflow 
that does not allow any compact dynamic labeling scheme. 

Given this, our next goal is to identify safe workflows that 
enable compact dynamic labeling. An elegant characteriza- 
tion for coarse-grained workflows is proved in [5] : given any 
coarse-grained workflow specification G x , there is a compact 
dynamic labeling scheme for G x iff G is a linear-recursive 
workflow grammar which is formally defined as follows. 

Definition 12. [5] (Linear-Recursive Workflow 

Grammar) A workflow grammar G = (E, A,S, P) is said 
to be linear-recursive if VM G A and W G E* such that 
M =>* W, W has at most one instance of M. 

Note that coarse-grained workflows are only a restricted 
class of (fine-grained) safe workflows. We show here that, in 
the fine-grained model, linear-recursiveness is not enough to 
enable compact dynamic labeling for safe workflows. 



Theorem 2. There is a linear-recursive grammar G and 
a safe dependency assignment A such that any dynamic la- 
beling scheme for G x requires linear-size data labels. 
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Figure 8: Counterexample in Proof of Theorem 2 

Proof. (Sketch) Figure 8 gives a counterexample G x with 
three produtions p a = S — > W a , Pb = S —¥ Wb and p c = 
S — > W c , where G is linear-recursive and A is safe. Observe 
that a run R x G L(G X ) is derived from the start module S 
by applying an arbitrary sequence of p a and pb , followed by 
one p c . Both p a and pb produce three new data items (data 
edges). We focus only on the dependency edges between the 
first two data items. Observe from Figure 8 that they form 
a binary tree that is created dynamically from left to right: 
if p a is applied, then the first data item is expanded, oth- 
erwise (if Pb is applied), the second data item is expanded. 
Using a similar technique to [9], we can prove that labeling 
such a dynamic tree requires linear-size data labels. □ 

Theorem 2 tells us that while fine-grained dependencies 
increase the expressive power of the model, they limit the 
recursive workflow structure that allows compact dynamic 
labeling. We thus identify a natural class of strictly linear- 
recursive workflow grammars for which dynamic, yet com- 
pact labeling is feasible for any safe dependency assignment. 
To define them, we introduce a production graph that de- 
scribes the derivation relationship between modules. 

Definition 13. (Production Graph) Given a workflow 
grammar G = (E, A, 5 1 , P), the production graph of G is a 
directed multigraph V(G) in which each vertex denotes a 
unique module in E. For each production M — > W in P and 
each module M' in W, there is an edge from M to M' in 
V{G). Note that if W has multiple instances of a modulo 
M' , then V(G) has multiple parallel edges from M to M' . 

Intuitively, every cycle in V(G) corresponds to a recursion 
in G. G is said to be recursive if V{G) is cyclic. A module 
in G is said to be recursive, if it belongs to a cycle in V{G). 

Definition 14- (Strictly Linear-Recursive Workflow 
Grammar) A workflow grammar G is said to be strictly 
linear-recursive if all the cycles in V(G) are vertex-disjoint. 

Remark 3. Strictly linear recursion is able to capture 
common recursive patterns that we observed from the myEx- 
periment workflow repository [19]. In particular, consider 
two common forms of recursion that we encounter in real- 
life scientific workflows. The first is called the loop exe- 
cution for which a sub-workflow is repeated sequentially a 
number of times until certain condition is met. The second 
is called the fork execution for which multiple copies of a 
sub-workflow are executed in parallel. In scientific workflow 
systems, such as Taverna [14] and Kepler [2], fork execu- 
tions are commonly used to model operations over complex 
data (e.g., "maps" over sets). Both loop and fork executions 
belong to a simple form of strictly linear recursion. 

It is easy to show that every strictly linear-recursive work- 
flow grammar is also linear-recursive, but not vice versa. 
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Figure 9: Production Graphs 

Example 10. Figure 9 (left) shows the production graph 
V(G) for the grammar G in Figure 2 (ignore number pairs 
on the edges). Observe that P(G) has two cycles: one be- 
tween A and B and the other (self-loop) over D. Since they 
are vertex-disjoint, G is strictly linear-recursive. Figure 9 
(right) shows the production graph P(G') for the grammar 
G' in Figure 8. Since V(G') has two self-loops that share S, 
G' is linear-recursive but not strictly linear-recursive. 

It is possible to test in polynomial time if a given gram- 
mar G is strictly linear-recursive. The algorithm starts by 
building the production graph V(G), then according to Def- 
inition 14, checks if any two cycles in P(G) share a vertex. 

The main result of this paper is to show that dynamic, 
yet compact labeling is feasible for strictly linear-recursive 
grammars with any safe dependency assignment. 

Theorem 3. Given any strictly linear-recursive workflow 
grammar G, for any safe dependency assignment A, there is 
a compact dynamic labeling scheme for G x . 

The following section describes our labeling scheme. 

4. VIEW- ADAPTIVE DYNAMIC LABELING 

This section presents a compact view-adaptive dynamic 
labeling scheme for strictly linear-recursive workflows with 
safe views. The rationale behind our label design is ex- 
plained as follows. Both data labels and view labels encode 
only partial (but orthogonal) reachability information. More 
precisely, a data label encodes only a subsequence of the run 
derivation that creates this data item, while a view label en- 
codes only the fine-grained dependencies that are defined in 
this view. However, a combination of two data labels and 
a view label provides the complete information to infer the 
reachability between the two data items over this view. 

We start with a preprocessing step in Section 4.1. Two in- 
dependent tasks for labeling dynamic runs and labeling safe 
views are described in Sections 4.2 and 4.3, respectively. 
Section 4.4 presents how to efficiently answer queries using 
a combination of data labels and view labels. Finally, Sec- 
tion 4.5 analyzes the quality of our labeling scheme. 

4.1 Preprocessing 

As a preprocessing step, we assign a pair of numbers to 
each edge in the production graph. These pairs serve as 
unique ids for the edges, and will be used later to label runs 
and views. Let G = (E, A, S, P) be a strictly linear-recursive 
grammar and V(G) be its production graph. First of all, we 
fix an arbitrary ordering among the productions in P, and 
for each production M — > W, fix an arbitrary topological 
ordering among the modules in W. Let pt — M — > W 
be the feth production in P, and Mi be the ith module in 
W, then we assign the edge from M to Mi in P(G) a pair 
(k,i). Hereafter, we simply refer to this edge as (k,i). In 



addition, we also fix an arbitrary ordering among all the 
(vertex-disjoint) cycles in P(G), and for each cycle, fix an 
arbitrary edge as the first edge of the cycle. We denote by 
C(s) the sth cycle in P(G) containing a list of number pairs. 

Example 11. For the grammar G in Figure 2, the pairs 
of numbers assigned to the edges in P(G) are shown in Fig- 
ure 9. Note that the productions pi, p2, . . ., ps are simply 
sorted by their subscripts. In Figure 2, all the modules in 
Wi are sorted topologically as a — > b — > A — > C — > c — > d. 
Therefore, the edge from S to c in Figure 9 is assigned (1,5) 
because pi — S — > Wi is the first production, and c is the 
fifth module in Wi. Moreover, the two cycles in P(G) are 
denoted by C(l) = {(2, 2), (4, 2)} and C(2) = {(6, 2)}. 

4.2 Labeling Dynamic Runs 

Given a derivation of a run R € L(G), our goal is to assign 
a data label 4> r {d) to each data item d in R as soon as it 
is produced. The labeling is based on a tree representation 
for runs, called the compressed parse tree. In contrast to the 
traditional parse tree used for context-free grammars whose 
depth may be proportional to the size of the run, the depth 
of a compressed parse tree is always bounded by the size 
of the specification. We will see later that this property is 
critical to enable compact (logarithmic-size) data labels. 

Definition 15. (Compressed Parse Tree) The compre- 
ssed parse tree for a run R is an ordered tree T(R), where 
each leaf node denotes an atomic module, and each non-leaf 
node denotes either a composite module (called the compos- 
ite node), or a linear recursion (called the recursive node). 
The children of a composite node denote all the modules of a 
simple workflow produced by a production, and are ordered 
by a fixed topological ordering; and the children of a recur- 
sive node denote a sequence of nested composite modules 
obtained by unfolding a cycle in the production graph. 
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Figure 10: Compressed Parse Tree 

Example 12. The compressed parse tree T(R) for the 
run R in Figures 3 and 4 * s shown in Figure 10 (ignore 
the edge labels), where R : 1 and R : 2 are recursive nodes. 
Note that A : 1, B : 1, A : 2, B : 2, A : 3 (children of R:\) are 
obtained by unfolding the cycle between A and B in Figure 9. 
In a standard parse tree, they would be connected in a path. 

Lemma 2. Given a strictly linear-recursive workflow 
grammar G, for any derivation of a run R £ L(G), the depth 
of the compressed parse tree T(R) is no greater than 2 * | A| , 
where \A\ is the number of composite modules in G. 
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We now describe the dynamic labeling algorithm. Given a 
derivation of a run R, we build T(R) in a top-down manner. 
During this process, we label each new edge and use the edge 
labels to construct labels for new data items on-the-fiy. We 
next explain the design of data labels step by step. 

Firstly, we describe the label for an edge in T(R). Let e be 
an edge from u to v in T(R). We denote by 4> r (e) the label 
of e. (1) If u is a composite node, then e can be mapped to 
an edge e' in V(G). Recall from Section 4.1 that each edge 
in P(G) is uniquely identified by a pair of numbers. Let 
e' = (k,i), then (f>r(e) = (k,i); and (2) otherwise (if u is a 
recursive node), let u denote the sth cycle in P(G) starting 
from the tth edge. This can be determined by the first child 
of u. Let v be the ith child of u, then <f> r (e) — (s,t,i). 

Secondly, we use a sequence of edge labels to construct 
the label for an input port i in 7?. We denote by 4> r {i) the 
label of i. Suppose i is first created as the :rth input port of 
a module M during the derivation of R, and M is denoted 
by a node v in T(R). Let ei, ei, . . ., e; be the path from the 
root node to v in T(R), then 4> r (i) = {(f> r (ei), <f> r (e2), ■ ■ ., 
4> r {ei), x}. For an output port o, <f) r (6) is defined similarly. 

Finally, we use a pair of input and output port labels to 
construct the label for a data item (data edge) d — (o, i) 
in R. We denote by <f> r {d) the label of d, then (f> r (d) = 
{(f> r {o),(t>r{i))- Since o and i must be created by the same 
production, <f> r (o) and 4>r(i) differ only in the last one or two 
edge labels. The size of 4>r(d) can be reduced almost by half 
by factoring out the common prefix of 4> r (o) and 4> r (i). 

Example 13. The edge labels for the compressed parse 
tree T(R) are shown in Figure 10. E.g., the edge from R.l 
to A : 3 is labeled by (1,1,5), because R : 1 denotes the first 
cycle in the production graph starting from the first edge (see 
Example 11), and A : 3 is the fifth child of R.l. Next, we 
label the data items. E.g., consider di\ = (o, i) in Figure 4, 
where o is the first output port ofb:2, and i is first created as 
the second input port of D : 1 (note that i is also the second 
input port of f:l). Then, 4> r (d2i) = (<f> r (o), 4> r (i)), where 
<M°) = {(1,3),(1,1,5),(3,2),(5,1),1} 
cj> r (i) = {(1, 3), (1, 1, 5), (3, 2), (5, 2), (2, 1, 1), 2} 

4.3 Labeling Safe Views 

Given a safe view U = (A, A) over G, our goal is to create 
a view label (j>v(U) which can be combined with above data 
labels to infer reachability over U. Using the algorithm in 
Section 3.1, we first compute the full dependency assignment 
A* by extending A to all the composite modules in A. 

Next, we define three functions, X, O and 2. Recall from 
Section 2.2 that Ga denotes the grammar obtained by re- 
stricting G to A. Let P(Ga) be the production graph of 
Ga, then P(Ga) is a subgraph of V(G). Recall from Sec- 
tion 4.1 that each edge in V(G) is uniquely identified by a 
pair of numbers (k,i). The input of X and O is an edge in 
P(Ga), denoted by a pair (k, i). The input of 2 is a pair of 
edges in P(Ga) of form (k, i) and (k,j). For simplicity, we 
also denote them by a triple (k, i,j). The output of all three 
functions is a reachability matrix, which is defined next. 

Functions X and O. Given an edge (k,i) in P(Ga), let 
Pk = M — > W be the kih production in P, and Mi be the 
ith module in W , then (1) X(k, i) is defined as a reachability 
matrix from the inputs of M to the inputs of Mi (w.r.t. A*); 
and (2) 0(k,i) is defined as a (reversed) reachability matrix 
from the outputs of M to the outputs of Mi (w.r.t. A*). 
Function Z. Given a pair of edges (k,i) and (k,j) in 
V(Ga), let p k = M -> W be the fcth production in P, 



and Mi and Mj be the ith and ji'th module in W, respec- 
tively, then Z(k,i,j) is defined as a reachability matrix horn 
the outputs of Mi to the inputs of Mj (w.r.t. A*). Note 
that Z(k, i,j) is an empty matrix (with only false values) if 
i > 3, since Mi and Mj are sorted in topological ordering. 

Finally, <f) v (U) consists of all the above three functions, 
along with A* (S) for the start module S. That is, 

MU) = {\*(s),x,o,z} 

Basically, the above view label encodes all the fine-grained 
dependency information that is specific to this view and is 
necessary for our decoding algorithm given in Section 4.4. 

Example 14. For the running example, we first label the 
default view Ui = (A, A) for which A* is computed in Ex- 
ample 9, and is shown on the top of Figure 7. Using A*, 
we can compute the functions X, O and Z. E.g., consider 
the edge (1,5) from S to c in Figure 9. The first production 
Pi — S —> W\ is shown in Figure 2. 1(1,5) denotes the 
reachability from the inputs of S (i.e., the initial inputs of 
W\) to the inputs of c (i.e., the fifth module in Wi); sim- 
ilarly, 0(1,2) denotes the (reversed) reachability from the 
outputs of S (i.e., the final outputs ofWi) to the outputs of 
b (i.e., the second module in Wi); and 2(1, 2, 5) denotes the 
reachability from the outputs of b to the inputs of c in Wi . 

"0 0" 



1(1,5) 



0(1,2) = 



2(1,2,5) 



Similarly, we can label the other view U2 = (A', A') defined 
in Example 7, whose full dependency assignment is shown 
on the bottom of Figure 7. Using Figure 5, we have 

"l 0" 



1(1,5) = 



0(1,2) = 



2(1,2,5) = 



As we can see above, the functions encoded by the view la- 
bels <f> v (Ui) and 4>v(U2) may evaluate to different values for 
the same input. Moreover, they are defined over different 
domains. E.g., X(5, 1) is defined for Ui but not for U2- 

Space-Efficient View Labeling. By default, we pre- 
compute all the reachability matrices for X, O and 2, and 
materialize them in the view label. Alternatively, one can 
compute them on-the-fly by performing a graph search over 
the view of a specification during the query time. In gen- 
eral, more sophisticated approaches (e.g., [15, 24, 22]) can 
be used to label the view, in order to find a better balance 
between the overhead of labeling views and query efficiency. 
We will further explore this tradeoff in the experiments. 

4.4 Decoding Data Labels with View Labels 

Using only two data labels 4> r (d\) and <j> r (d2) and a view 
label 4> V (U), one can decide if (fe depends on d\ w.r.t. U 
by a decoding predicate ir. We first define in Section 4.4.1 
two procedures used by tt, namely, Inputs and Outputs, and 
then describe n in Section 4.4.2. Section 4.4.3 presents fast 
matrix multiplication used to achieve constant query time. 

4.4.1 Precedures Inputs and Outputs 

Let e be an edge from u to v in the compressed parse tree 
T(R). Given the edge label <j> r (e) (defined in Section 4.2) 
and a view label 4> V (U), our procedure Inputs computes a 
reachability matrix lnputs(0 r (e), 4> V (U)) by Algorithm 1. 

Case 1. [Line 1 to Line 2] If cj> r (e) = (k,i), that is, if u 
is a composite node, then Inputs computes a reachability 
matrix from the inputs of the module denoted by u to the 
inputs of the module denoted by v, simply given by X(k,i). 
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Case 2. [Line 3 to Line 8] If (j> r {e) = (s,t,i), that is, if u 
is a recursive node, then v is the ith child of u. Let Mi, 
M 2 , . . ., Mi be the modules denoted by the first i children 
of u. They are a sequence of nested composite modules in 
Ru obtained by unfolding the sth cycle in V(G) starting 
from the ith edge. Inputs finally computes a reachability 
matrix from the inputs of Mi to the inputs of Mi in Ru by 
multiplying all i — 1 intermediate reachability matrices. 

Algorithm 1 Procedure Inputs 
Input: 4> r (e) = (k,i) or (s,t,i) 

MU) = {y(s),x,o,z} 

Output: Inputs(</>T-(e), <p v (U)) 
1: if 4> r (e) = (k, i) then 
2: return X(k,i) 
3: else {4> r {e) = (s,t,i)} 
4: let C(s) = {(ki,ii),{k2,h),...,{ki,ii)} 
5: // C(s) denotes the sth cycle in V(G) of length I 
6: let Va > 1, k a +i = k a and i a +i = i a 
7: return l\ l a ^} 1 l(kt +a -i,u+a-i) 
8: end if 



The other procedure Outputs is defined similarly, which 
computes a (reversed) reachability matrix for output ports. 

Example 15. Let e be the edge from R : 1 to A : 3 in 
Figure 10 and U\ be the default mew. (f> r (e) = (1,1,5) and 
4> v (Ui) are explained in Examples 13 and 14- For this pair of 
labels, Algorithm 1 computes the reachability matrix from the 
inputs of A:l to the inputs of A: 3 in Ru 1 - By Example 11, 
the first cycle ts C{\) = {(2, 2), (4, 2)}. Therefore, 

lnputs(0 r (e), 4> v {Ui)) = 1(2, 2) x 1(4, 2) x 1(2, 2) x 1(4, 2) 



4.4.2 Decoding Predicate 

Given a pair of data labels 4> r (d\) and (j>r{d2) and a view 
label 4> V (U) — {A*(S),T, O, Z}, our goal is to evaluate n 
to true iff d2 depends on rfi w.r.t. U. Due to space con- 
straints, we sketch only the main cases, where both di and 
di are intermediate data items of R. The complete descrip- 
tion can be found in the full version of this paper [6]. Let 

4>r{d\) = (<pr(0l),<t>r(il)) and 4> r (d 2 ) = (0r(o 2 ), "M^)), 

then d2 depends on di w.r.t. U iff i 2 is reachable from 
01 in Ru- Let 4> r (oi) = {h,x} and (j>r{i2) = {h,y}, where 
h and h are two lists of edge labels. Suppose during the 
derivation of R, 01 is first created as the xth output port 
of some module Mi and 12 is first created as the j/th input 
port of some module M 2 . Suppose Mi and M 2 are denoted 
by two nodes v\ and V2 in the compressed parse tree T{R). 

Case 1. If h = I2 or one is a prefix of the other, that is, 
t>i = t>2 or one is an ancestor of the other in T(R), then 
Mi = M2 or one is derived from the other. Thus, 12 is not 
reachable from o\ in Ru, and tt evaluates to false. 

Case 2. Otherwise, suppose h and I2 agree on the first I — 1 
edge labels, but differ on the Ith edge label. Moreover, let 
the length of li and l 2 be p and q, respectively. That is, 

^i = {4> r {ei), . . . , (f>r(ei-i), 4> r {ei), 4> r (e p )} 

h = {4>r{ei), . . . , 4> r {ei-i), 4>r{e'i), (j>r{e' q )} 

where <f) r {ei) 7^ <j)r( e 'i)- We denote by v — LCA(v\,V2) the 
least common ancestor of vi and v 2 in T(R). Let e; be an 
edge from v to v[ and e[ be an edge from v to v 2 . Let M[ 
and M'2 be the module denoted by v[ and v' 2 , respectively. 
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Figure 11: Main Cases of Decoding Predicate 

Case 2a. If 4> r (e-i) = (k,i) and 4> r (e'i) = (k,j), that is, 
v — LCA(v 1 ,v 2 ) is not a recursive node, then we compute 

= nP =;+1 0utputs((?i r (e a ), 4> V {U)) 

1 = U. q a=l+1 Inp\xts((f>r(e' a ),(f> v {U)) 

and Z — Z(k,i,j). As illustrated by the top right corner of 
Figure 11, O is the (reversed) reachability matrix from the 
outputs of M[ to the outputs of Mi, Z is the reachability 
matrix from the outputs of M[ to the inputs of M 2 , and I is 
the reachability matrix from the inputs of M 2 to the inputs 
of M 2 . Thus, O t x Z x I gives the reachability matrix from 
the outputs of Mi to the inputs of M 2 , where O t denotes 
the transpose of O. So n evaluates to (O xZx I)[x,y]. 

Case 2b. If r (e;) = (s,t,i) and <?V(eJ) = (s,t,j), that is, 
v — LCA(vi,v 2 ) is a recursive node, we consider the case 
where i < j. The other case where i > j can be handled in 
a similar manner. First of all, if p — I, that is, vi = v[ and 
Mi = M[, then M 2 is derived from Mi. By Case 1, we know 
that 12 is not reachable from 01 in Ru- So % evaluates to 
false. Otherwise, as illustrated by the bottom right corner 
of Figure 11, using a similar decoding process to Case 2a, ir 
evaluates to (O t x Z x I' x I)[x,y] (see [6] for details). 

4.4. 3 Fast Matrix Multiplication 

To achieve constant query time, we need to show that 
Inputs and Outputs can be implemented in constant time. 

Lemma 3. Given a fixed strictly linear-recursive gram- 
mar G, for any edge label r (e) and any data label <j> v (U), 
Inputs and Outputs can be computed in constant time. 

Proof. Consider Case 2 in Algorithm 1. First observe 
the repeated pattern of length I in the i — 1 intermediate 
reachability matrices. Let X be the multiplication of the 
first I matrices. So we only need to efficiently compute 
^U-i/'J Further observe the repeated pattern in the se- 
quence X, X 2 , X^ -1 /'-!. Suppose any module has at 
most c input or output ports. Note that c is a constant 
for a fixed G. Since each matrix has at most 2 CXC possible 
boolean values, we can find in constant time a and b such 
that a < b <= 2 CXC + 1 and X a = X b . Once a and b arc 
found, x'- 1-1 /'-' can be computed in constant time. □ 

Query-Efficient View Labeling. To speed up the query 
processing, one can also pre-compute a and b for each re- 
cursion in the view, and materialize a and b (as well as 
X 1 , X 2 , . . . , X b ) in the view label. In contrast to space- 
efficient view labeling (Section 4.3), this is the other extreme 
alternative that will be compared in the experiments. 



1216 



4.5 Labeling Scheme Quality Analysis 

We analyze the label length and construction time for both 
data labels and view labels, as well as the query time for 
comparing a pair of data labels and a view label. Note that 
we take the size of a specification as constant [4, 5], and 
measure the complexity in terms of the size of the run. We 
next show that all the above parameters, guaranteed by our 
labeling scheme, are optimal up to a constant factor. 

Theorem 4. Let (<f> r , (j> v ,ir) be our view- adaptive dynamic 
labeling scheme for a strictly linear-recursive specification G. 

1. logarithmic label length and linear total construction 
time for data labels: for any derivation of a run R £ 
L(G) with n data items and for any data item d in 
R, <fi r {d) has O(logn) bits, and all data labels can be 
constructed dynamically in a total of 0(n) time. 

2. constant label length and constant construction time 
for view labels: for any safe view U over G, <f>„ (U) has 
0(1) bits and can be constructed in O(l) time. 

3. constant query time: for any pair of data labels <j>r(di) 
and 4> r (d2) and for any view label <f> v (U), ir(<j> r (di), 
<f>r(d2), (f> v (U)) can be evaluated in O(l) time. 

PROOF. (Sketch) Lemma 2 ensures O(logn) data label 
length. Lemmas 2 and 3 ensure O(l) query time. □ 

User-Defined Views. Our view-adaptive labeling scheme 
can be extended to handling more general types of views, 
where users may create their own composite modules (rather 
than using pre-defined ones) or may hide ports or data edges. 
Details can be found in the full version of this paper [6] . 

5. EXPERIMENTAL EVALUATION 

We now empirically evaluate the effectiveness of our view- 
adaptive labeling approach. Section 5.2 reports the main 
cost of labeling, which is labeling runs. Section 5.3 explores 
the tradeoff between the overhead of labeling views and 
query time by comparing three alternative implementations. 
Section 5.4 demonstrates the superiority of view-adaptive la- 
beling over the state-of-the-art technique [5] when applied to 
label multiple views. Section 5.5 identifies important factors 
that influence the performance of view-adaptive labeling. 

5.1 Experimental Setup 

Real-Life and Synthetic Datasets. Our real-life scien- 
tific workflows were collected form the myExperiment work- 
flow repository [19]. We observed that almost all of them 
have fairly simple recursive patterns. For simplicity, we re- 
port only the results for one representative workflow, called 
Bio AID. It is denoted by a strictly linear-recursive grammar 
with 112 modules (16 are composite) and 23 productions (7 
are recursive). Each production produces a simple workflow 
with at most 19 modules, and each module has at most 4 
input ports and 7 output ports. In Section 5.5, we also eval- 
uate a family of synthetic workflows. Due to the absence 
of real workflow executions, we simulated runs by apply- 
ing a random sequence of productions, varying their sizes 
(i.e., the number of data items) from IK to 327^ by a fac- 
tor of 2. The derivations of runs were recorded and used 
as dynamic inputs to labeling schemes. In addition, we ob- 
tained safe views by enumerating all possible proper subsets 
of composite modules and assigning random input-output 
dependencies to atomic modules. All the data are stored as 
XML files whose parsing time is omitted from the results. 



Labeling Schemes. Our view-adaptive dynamic label- 
ing scheme is denoted by FVL for (F)ine-grained (V)iew- 
adaptive (L)abeling. We implemented three variants: (1) 
Default FVL (Section 4.3) (2) Space- Efficient FVL (Sec- 
tion 4.3) and (3) Query-Efficient FVL (Section 4.4.3). They 
use the same dynamic algorithm to label runs, but differ in 
how views are labeled, which affects query efficiency. We 
also compared FVL with the state-of-the-art scheme, called 
DRL [5], for (L)abeling (D)ynamic runs of (R)ecursive work- 
flows. All the labeling schemes were implemented in Java. 

Evaluation Methodology. To evaluate labeling overhead, 
we measure both label length (space overhead) and construc- 
tion time (time overhead) for data labels and view labels, 
respectively. For data labels, each data point in the result is 
an average over 100 sample runs. We also measure the query 
time. Each data point for query time is an average over 10 6 
sample queries. All the experiments were performed on a lo- 
cal PC with Intel(R) Core(TM) 17-2600 3.40GHz CPU and 
4GB memory running Windows 7 Professional. 

5.2 Overhead of Labeling Runs 

We first evaluate the overhead of labeling runs using FVL 
and DRL. Note that FVL is view-adaptive: the data labels 
created for one run can be re-used to answer queries over 
all safe views. In contrast, DRL is not view-adaptive: a run 
must be re-labeled for each view. Here, the comparison be- 
tween them focuses on the case where only one default view 
is defined over the workflow. A more meaningful comparison 
for multiple views will be carried out in Section 5.4. 

Figure 12 reports the maximum and average length of 
data labels created by FVL and DRL. We denote them by 
FVL-max, FVL-avg, DRL-max and DRL-avg, respectively. 
A careful analysis of Figure 12 can show that all four lines 
are nearly parallel to the asymptotic line f(x) = log a;. This 
implies that both FVL and DRL produce compact data la- 
bels of logarithmic length with a constant factor close to 
1. Surprisingly, FVL-avg (FVL-max) is even shorter than 
DRL-avg (DRL-max) by about 5 bits. This small improve- 
ment is due to the compact design of data labels in FVL 
which encode only the structure of runs. 

Figure 13 reports the construction time of data labels for 
FVL and DRL. While both build all data labels in linear 
time, FVL is faster than DRL by about 10% for large runs. 

5.3 View Labeling Cost vs. Query Efficiency 

Next, we evaluate the overhead of labeling views as well 
as the query time, and explore the tradeoff between them 
by comparing three variants of FVL: (1) Default FVL pre- 
computes all reachability matrices for the three functions 
X, O and Z, and materializes them in the view label (Sec- 
tion 4.3); (2) Space-Efficient FVL pre-computes only the full 
dependency assignment for each view, and thus any access 
to X, O and Z will be answered by performing a graph search 
over the view of a specification at query time (Section 4.3); 
and (3) Query-Efficient FVL materializes, in addition to X, 
O and Z, all intermediate states of fast matrix multiplica- 
tion for each recursion in the view (Section 4.4.3). 

In the experiments, we label three safe views, namely, 
small view, medium view and large view, with varying sizes 
and random dependency assignments. We estimate the size 
of a view by the number of composite modules that can ex- 
pand. The three views contain 2, 8 and 16 composite mod- 
ules, respectively. Figure 14 shows the length of view labels 
created by all three variants of FVL. As expected, Query- 
Efficient FVL creates the longest labels for all three views. 
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However, compared with Default FVL, the extra space over- 
head is small (less than 8 bytes), since views typically have 
a small amount of recursions. On the other hand, Space- 
Efficient FVL creates almost no index for each view (less 
than 5 bytes). The results for construction time, not shown, 
reveal a similar trend. While Query-Efficient FVL labels the 
large view in 0.62 ms, Space- Efficient FVL needs only 0.08 
ms. Comparing Figures 12 and 14 also shows that the main 
overhead of FVL lies in the labeling of runs, e.g., the data 
labels for a small run with IK data items take a total of 
5KB, while the view label created by Query-Efficient FVL 
for the large view takes only 0AKB. The overall difference 
is even bigger, since for a given workflow, the number of 
runs is typically much greater than the number of views. 

After runs and views are both labeled (independently), 
we generate sample queries by randomly selecting two data 
items in the same run (with varying size) and randomly se- 
lecting one out of the three views. The query time for the 
three variants of FVL is reported in Figure 15. Compared 
to Figure 14, we can see a clear tradeoff between the over- 
head of labeling views and query efficiency. Query-Efficient 
FVL and Default FVL are faster than Space-Efficient FVL 
by almost one order of magnitude. Query-Efficient FVL is 
also significantly faster than Default FVL (by about 40% for 
large runs), while as shown in Figure 14, it takes only small 
extra space overhead (less than 2% for the large view). 



Finally, we should notice that all three variants of FVL 
achieve constant view label length and constant query time, 
in terms of the size of the run. In other words, there is only 
a constant tradeoff between space and time for the three 
approaches. Therefore, Query-Efficient FVL is preferable 
to the other two variants, since it enables the fastest query 
processing with little extra labeling overhead. All the above 
results also validate our complexity analysis in Theorem 4. 

5.4 Advantage of View-Adaptive Labeling 

We now compare FVL against DRL when multiple views 
are defined over the same workflow. Since DRL applies only 
to the coarse-grained model with black-box dependencies, 
to make a meaningful comparison we randomly generate 10 
medium-size views with black-box dependencies. 

First, we compare the labeling overhead of FVL and DRL. 
Our focus is on the overhead of labeling runs, which is the 
main cost. We fix the size of runs to be 8K (data items), and 
vary the number of views from 1 to 10. Figure 16 shows the 
total length of data labels assigned to one data item. Since 
FVL is view-adaptive, the data label created for one data 
item can be re-used to query over multiple views. Therefore, 
in Figure 16, the total length for FVL remains constant. In 
contrast, given a data item, DRL has to maintain one data 
label for each view separately. So in Figure 16, the total 
length for DRL grows linearly with the number of views. 
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A similar result for the total construction time can be 
observed in Figure 17. Note that DRL is faster than FVL 
for one view, since DRL labels the medium-size view of a 
run, which is smaller than the original run. However, when 
there are more than 3 views, FVL is more time-efficient. 

We next compare the query time of FVL and DRL. In 
order to achieve a fair comparison, we take the most query- 
efficient variant of both FVL and DRL. Since our compar- 
ison can only use coarse-grained views, many of the reach- 
ability matrices involved in the decoding of FVL are com- 
plete matrices (i.e., with only true values). So we also im- 
plemented a simplified version of FVL, called Matrix-Free 
FVL, which is optimized for coarse-grained views by avoid- 
ing redundant matrix multiplications in the decoding. 

We evaluate the above three approaches over three coarse- 
grained views with varying sizes. As shown in Figure 18, 
FVL is about 4 times slower than DRL, but by removing 
redundent computations for coarse-grained views, Matrix- 
Free FVL achieves almost same query time as DRL. 

5.5 Important Factors 

Finally, we examine the effectiveness of FVL over a variety 
of synthetic workflows. The goal is to identify factors that 
affect FVL. In particular, we consider: (1) workflow size: 
the number of modules in a simple workflow (default = 40); 
(2) module degree: the number of input/output ports of a 
module (default — 4); (3) nesting depth: the depth of nested 
composite modules (default = 4); and (4) recursion length: 
the number of composite modules in a recursion (default = 
2). We created a family of synthetic workflows by varying 
each of the four parameters and fixing the rest to be the de- 
fault value. For each workflow, we evaluate (1) the overhead 
of labeling a run R with 8K data items; (2) the overhead 
of labeling a safe view U with all composite modules and 
random dependency assignment; and (3) the query time for 
data items in R over U. Due to space constraints, we show 
only the results for two key factors that affect FVL. 

One factor that has high impact on the data label length 
is nesting depth. As shown in Figure 19, the (average) data 
label length created by FVL grows linearly with the nesting 
depth, because the nesting depth determines the depth of 
the compressed parse tree which is used to build data labels. 

Another factor that has high impact on the query time is 
module degree. As shown in Figure 20, the query time for 
Query-Efficient FVL grows almost linearly with the module 
degree. This is mainly because the module degree deter- 
mines the cardinality of reachability matrices, and multi- 
plying large matrices at query time can be expensive. 

6. CONCLUSIONS 

This paper considers the problem of efficiently answer- 
ing reachability queries over views of workflow provenance 
graphs. For that we design a novel view-adaptive labeling 
scheme that supports fine-grained dependencies between in- 
puts and outputs of modules and combines static labeling 
of views with dynamic labeling of data items. In partic- 
ular, we identify a natural class of safe views over strictly 
linear-recursive workflows for which dynamic, yet compact 
labeling is feasible. The experimental results demonstrate 
the advantage of our view-adaptive labeling approach over 
the state-of-the-art technique [5] when applied to label mul- 
tiple views. Previous work [12] considers efficient evaluation 
of XPath queries over XML views. Extending our work to 
similarly rich query constructs in the context of workflow 
views is an interesting direction for future research. 
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